431 Classes 09 and 10

Thomas E. Love, Ph.D.

2024-09-24

This Week’s Agenda

  • Managing the Favorite Movies Data
    • Ingesting data from a Google Sheet
    • Checking the variables
    • Some semi-sophisticated cleaning: imdb_categories
    • What makes a “good” research question?
  • Review of Independent Samples Comparisons (including ANOVA)
  • Simple Linear Regression Models
    • Fitting an OLS model
    • Performance of an OLS model
    • Checking the Fit
    • Transformations
    • What makes a “good” fit?

Load packages and set theme

library(ggrepel)        ## new: for building plots with text
library(glue)           ## new: for combining strings
library(googlesheets4)  ## new: importing Google Sheets data
library(naniar)         ## new: counting missingness
library(knitr)
library(kableExtra)     ## for neatening tables in slides
library(janitor)
library(car)            ## for boxCox function
library(infer)          ## bootstrapping
library(patchwork)
library(rstanarm)
library(easystats)
library(tidyverse)

theme_set(theme_bw())
knitr::opts_chunk$set(comment = NA)

source("c09/data/Love-431.R") # for the lovedist() function

Importing Data on Favorite Movies

gs4_deauth() # indicates to Google Drive that you're reading a public file

url <- "https://docs.google.com/spreadsheets/d/155iHDSUr8ZixX4nVcNq9HMkKbBizUhCsXHcteIdNddU"

mov1 <- read_sheet(url)
✔ Reading from "movies_2024-09-17".
✔ Range '2024-09-17 Data'.
dim(mov1)
[1] 228  18
names(mov1)
 [1] "mov_id"          "movie"           "year"            "length"         
 [5] "imdb_categories" "imdb_ratings"    "imdb_stars"      "mpa"            
 [9] "dr_love"         "mentions"        "list_20"         "list_21"        
[13] "list_22"         "list_23"         "list_24"         "imdb_synopsis"  
[17] "imdb_link"       "imdb_cat1"      

Any missing values?

n_miss(mov1)
[1] 0
miss_var_summary(mov1)
# A tibble: 18 × 3
   variable        n_miss pct_miss
   <chr>            <int>    <num>
 1 mov_id               0        0
 2 movie                0        0
 3 year                 0        0
 4 length               0        0
 5 imdb_categories      0        0
 6 imdb_ratings         0        0
 7 imdb_stars           0        0
 8 mpa                  0        0
 9 dr_love              0        0
10 mentions             0        0
11 list_20              0        0
12 list_21              0        0
13 list_22              0        0
14 list_23              0        0
15 list_24              0        0
16 imdb_synopsis        0        0
17 imdb_link            0        0
18 imdb_cat1            0        0

Looking at just our key variables

Restricting the data to our six key variables (plus the identifiers mov_id and movie):

mov2 <- mov1 |> select(mov_id, imdb_stars, imdb_ratings, length,
                       mpa, year, imdb_categories, movie)
glimpse(mov2)
Rows: 228
Columns: 8
$ mov_id          <chr> "M-001", "M-002", "M-003", "M-004", "M-005", "M-006", …
$ imdb_stars      <dbl> 8.4, 8.0, 7.3, 8.3, 7.9, 7.8, 8.5, 8.4, 7.9, 8.4, 8.4,…
$ imdb_ratings    <dbl> 442000, 127000, 397000, 730000, 58000, 396000, 978000,…
$ length          <dbl> 170, 138, 97, 149, 119, 123, 117, 160, 162, 149, 181, …
$ mpa             <chr> "PG-13", "NR", "PG-13", "G", "TV-PG", "R", "R", "PG", …
$ year            <dbl> 2009, 1963, 1999, 1968, 2009, 2013, 1979, 1984, 2009, …
$ imdb_categories <chr> "Comedy, Drama", "Epic, Psychological Drama, Showbiz D…
$ movie           <chr> "3 Idiots", "8 1/2", "10 Things I Hate About You", "20…

Assessing variables by type

  • Identification variables: mov_id and movie.
    • Are these distinct (do we have a different value of these in every row of our data?
nrow(mov2); n_distinct(mov2$mov_id); n_distinct(mov2$movie)
[1] 228
[1] 228
[1] 228

OK.

Assessing variables by type

  • Quantities (check ranges)
mov2 |> reframe(range(imdb_stars), range(imdb_ratings), 
                range(length), range(year))
# A tibble: 2 × 4
  `range(imdb_stars)` `range(imdb_ratings)` `range(length)` `range(year)`
                <dbl>                 <dbl>           <dbl>         <dbl>
1                 3.4                   282              70          1942
2                 9.3               2900000             207          2024
  • imdb_stars = weighted average movie score (1 - 10)
  • imdb_ratings = # of users who’ve rated movie
  • length = length of movie, in minutes
  • year = year movie was released

Assessing variables by type

  • Categorical (assess levels, create factors)
mov2 |> count(mpa)
# A tibble: 9 × 2
  mpa       n
  <chr> <int>
1 G         7
2 NR       13
3 PG       62
4 PG-13    74
5 R        67
6 TV-14     1
7 TV-G      2
8 TV-MA     1
9 TV-PG     1

Dealing with mpa

  • Let’s create a four-level factor, as follows:
mov2 <- mov2 |>
  mutate(mpa = fct_lump_n(mpa, n = 3, other_level = "Other"))

mov2 |> count(mpa)
# A tibble: 4 × 2
  mpa       n
  <fct> <int>
1 PG       62
2 PG-13    74
3 R        67
4 Other    25
  • Here, fct_lump_n() with n = 3 collapses all mpa values that occur less often than the top 3 mpa values.

Final Key Variable: imdb_categories

mov2 |> count(imdb_categories) |> arrange(desc(n))
# A tibble: 215 × 2
   imdb_categories                                                             n
   <chr>                                                                   <int>
 1 Comedy, Drama                                                               3
 2 Drama                                                                       3
 3 Adventure Epic, Desert Adventure, Globetrotting Adventure, Quest, Acti…     2
 4 Comedy                                                                      2
 5 Comedy, Drama, Music                                                        2
 6 Comedy, Drama, Romance                                                      2
 7 Drama, Romance                                                              2
 8 Feel-Good Romance, Comedy, Drama, Romance                                   2
 9 Feel-Good Romance, Romantic Comedy, Comedy, Drama, Romance                  2
10 Period Drama, Drama, Romance                                                2
# ℹ 205 more rows
  • In 228 films, we see 215 different combinations of genres in imdb_categories.

Separate imdb_categories

mov2 <- mov2 |>
  separate_wider_delim(imdb_categories, delim = ",", cols_remove = FALSE, 
                       too_few = "align_start",
                       names = c("genre01", "genre02", "genre03", "genre04",
                                 "genre05", "genre06", "genre07", "genre08",
                                 "genre09", "genre10"))

names(mov2)
 [1] "mov_id"          "imdb_stars"      "imdb_ratings"    "length"         
 [5] "mpa"             "year"            "genre01"         "genre02"        
 [9] "genre03"         "genre04"         "genre05"         "genre06"        
[13] "genre07"         "genre08"         "genre09"         "genre10"        
[17] "imdb_categories" "movie"          

Missing genre values?

miss_var_summary(mov2)
# A tibble: 18 × 3
   variable        n_miss pct_miss
   <chr>            <int>    <num>
 1 genre10            211    92.5 
 2 genre09            195    85.5 
 3 genre08            176    77.2 
 4 genre07            151    66.2 
 5 genre06            116    50.9 
 6 genre05             80    35.1 
 7 genre04             44    19.3 
 8 genre03             15     6.58
 9 genre02              5     2.19
10 mov_id               0     0   
11 imdb_stars           0     0   
12 imdb_ratings         0     0   
13 length               0     0   
14 mpa                  0     0   
15 year                 0     0   
16 genre01              0     0   
17 imdb_categories      0     0   
18 movie                0     0   

Is this a Superhero movie?

  • Create a variable called superhero which is 1 if the movie’s imdb_categories list includes Superhero and 0 otherwise.
mov2 <- mov2 |>
  mutate(superhero = as.numeric(
    str_detect(imdb_categories, fixed("Superhero"))))

mov2 |> count(superhero)
# A tibble: 2 × 2
  superhero     n
      <dbl> <int>
1         0   215
2         1    13

Our 13 “Superhero” Movies

mov2 |> select(superhero, movie) |> filter(superhero == 1)
# A tibble: 13 × 2
   superhero movie                            
       <dbl> <chr>                            
 1         1 Avengers: Infinity War           
 2         1 Avengers: Endgame                
 3         1 Big Hero 6                       
 4         1 Black Panther                    
 5         1 Captain Marvel                   
 6         1 The Dark Knight                  
 7         1 The Dark Knight Rises            
 8         1 Doctor Strange                   
 9         1 Iron Man                         
10         1 Mystery Men                      
11         1 Spider Man: Into The Spider-Verse
12         1 Thor: Love and Thunder           
13         1 V for Vendetta                   

Indicators of 12 Most Common Genres

mov2 <- mov2 |> 
  mutate(action = as.numeric(str_detect(imdb_categories, fixed("Action"))),
         adventure = as.numeric(str_detect(imdb_categories, fixed("Adventure"))),
         animation = as.numeric(str_detect(imdb_categories, fixed("Animation"))),
         comedy = as.numeric(str_detect(imdb_categories, fixed("Comedy"))),
         crime = as.numeric(str_detect(imdb_categories, fixed("Crime"))),
         drama = as.numeric(str_detect(imdb_categories, fixed("Drama"))),
         family = as.numeric(str_detect(imdb_categories, fixed("Family"))),
         fantasy = as.numeric(str_detect(imdb_categories, fixed("Fantasy"))),
         mystery = as.numeric(str_detect(imdb_categories, fixed("Mystery"))),
         romance = as.numeric(str_detect(imdb_categories, fixed("Romance"))),
         scifi = as.numeric(str_detect(imdb_categories, fixed("Sci-Fi"))),
         thriller = as.numeric(str_detect(imdb_categories, fixed("Thriller")))
  )

Genre Counts (across our 228 movies)

  • For the 12 most common genres…
mov2 |> 
  select(action:thriller) |>
  colSums()
   action adventure animation    comedy     crime     drama    family   fantasy 
       60        80        25        97        29       132        36        53 
  mystery   romance     scifi  thriller 
       22        55        44        46 

Build a tibble of genre counts

genre_counts <- mov2 |> 
  select(action:thriller) |>
  colSums() |> 
  t() |> as_tibble() |> pivot_longer(action:thriller) |>
  rename(genre = name, movies = value) |> 
  arrange(desc(movies))

The genre_counts tibble

genre_counts
# A tibble: 12 × 2
   genre     movies
   <chr>      <dbl>
 1 drama        132
 2 comedy        97
 3 adventure     80
 4 action        60
 5 romance       55
 6 fantasy       53
 7 thriller      46
 8 scifi         44
 9 family        36
10 crime         29
11 animation     25
12 mystery       22

How many movies are

  • both Romance and Comedy?
mov2 |> count(comedy, romance)
# A tibble: 4 × 3
  comedy romance     n
   <dbl>   <dbl> <int>
1      0       0   110
2      0       1    21
3      1       0    63
4      1       1    34

34 Romantic Comedies?

mov2 |> filter(comedy == 1, romance == 1) |>
  select(movie) |> slice(1:18)
# A tibble: 18 × 1
   movie                      
   <chr>                      
 1 10 Things I Hate About You 
 2 About Time                 
 3 Clueless                   
 4 Coming To America          
 5 Crazy Rich Asians          
 6 Elf                        
 7 Flipped                    
 8 Harold and Maude           
 9 High School Musical 2      
10 Jab We Met (When We Met)   
11 La La Land                 
12 Legally Blonde             
13 Life As We Know It         
14 Life Is Beautiful          
15 The Lobster                
16 Mamma Mia!                 
17 Mamma Mia: Here We Go Again
18 Monte Carlo                
mov2 |> filter(comedy == 1, romance == 1) |>
  select(movie) |> slice(19:34)
# A tibble: 16 × 1
   movie                                        
   <chr>                                        
 1 Murder Mystery                               
 2 My Big Fat Greek Wedding                     
 3 Notting Hill                                 
 4 Om Shanti Om (Peace Be With You)             
 5 The Parent Trap                              
 6 Real Genius                                  
 7 Scott Pilgrim vs. The World                  
 8 The Secret Life of Walter Mitty              
 9 Shrek                                        
10 Shrek 2                                      
11 Tangled                                      
12 Thor: Love and Thunder                       
13 Trolls                                       
14 When Harry Met Sally                         
15 Yeh Jawaani hai Deewani (Youth is Crazy)     
16 Zindagi Na Milegi Dobara (You Only Live Once)

Any romance + comedy + superhero?

mov2 |> filter(romance == 1, comedy == 1, superhero == 1) |>
  select(mov_id, movie, imdb_categories)
# A tibble: 1 × 3
  mov_id movie                  imdb_categories                                 
  <chr>  <chr>                  <chr>                                           
1 M-209  Thor: Love and Thunder Superhero, Action, Adventure, Comedy, Fantasy, …
mov2 |> filter(mov_id == "M-209") |> select(imdb_categories, mov_id)
# A tibble: 1 × 2
  imdb_categories                                                mov_id
  <chr>                                                          <chr> 
1 Superhero, Action, Adventure, Comedy, Fantasy, Romance, Sci-Fi M-209 

Note

Thor: Love and Thunder does not have the “Romantic Comedy” genre. Only nine of our movies do (About Time, Clueless, Coming to America, Crazy Rich Asians, Harold and Maude, Legally Blonde, Mamma Mia!, My Big Fat Greek Wedding and Notting Hill.)

Superhero, Sci-Fi & IMDB Stars (1/4)

mov2 |> group_by(superhero) |> reframe(lovedist(imdb_stars)) |>
  kbl(digits = 1) |> kable_material(font_size = 28)
superhero n miss mean sd med mad min q25 q75 max
0 215 0 7.5 0.9 7.7 0.7 3.4 7.1 8.1 9.3
1 13 0 7.7 0.9 7.9 0.7 6.1 7.3 8.4 9.0
mov2 |> group_by(scifi) |> reframe(lovedist(imdb_stars)) |>
  kbl(digits = 1) |> kable_material(font_size = 28)
scifi n miss mean sd med mad min q25 q75 max
0 184 0 7.5 0.9 7.7 0.7 3.4 7.1 8.1 9.3
1 44 0 7.6 0.7 7.7 0.9 6.1 7.1 8.3 8.8
  • But what if a movie is in both genres?

Superhero, Sci-Fi & IMDB Stars (2/4)

What to do about the movies with “Superhero” and “Sci-Fi”?

mov2 |> filter(superhero == 1, scifi == 1) |>
  select(mov_id, movie, superhero, scifi, imdb_stars)
# A tibble: 10 × 5
   mov_id movie                  superhero scifi imdb_stars
   <chr>  <chr>                      <dbl> <dbl>      <dbl>
 1 M-010  Avengers: Infinity War         1     1        8.4
 2 M-011  Avengers: Endgame              1     1        8.4
 3 M-019  Big Hero 6                     1     1        7.8
 4 M-021  Black Panther                  1     1        7.3
 5 M-029  Captain Marvel                 1     1        6.8
 6 M-049  Doctor Strange                 1     1        7.5
 7 M-105  Iron Man                       1     1        7.9
 8 M-151  Mystery Men                    1     1        6.1
 9 M-209  Thor: Love and Thunder         1     1        6.2
10 M-217  V for Vendetta                 1     1        8.1

Superhero, Sci-Fi & IMDB Stars (3/4)

mov2 |> group_by(superhero, scifi) |> 
  reframe(lovedist(imdb_stars)) |>
  kbl(digits = 1) |> kable_material(font_size = 28)
superhero scifi n miss mean sd med mad min q25 q75 max
0 0 181 0 7.5 0.9 7.7 0.7 3.4 7.1 8.1 9.3
0 1 34 0 7.7 0.7 7.7 0.9 6.5 7.1 8.3 8.8
1 0 3 0 8.6 0.3 8.4 0.0 8.4 8.4 8.7 9.0
1 1 10 0 7.4 0.8 7.7 0.9 6.1 6.9 8.0 8.4
  • What groups might we want to compare here?

Superhero, Sci-Fi & IMDB Stars (4/4)

mov2 |> filter(superhero == 1, scifi == 0) |>
  select(movie)
# A tibble: 3 × 1
  movie                            
  <chr>                            
1 The Dark Knight                  
2 The Dark Knight Rises            
3 Spider Man: Into The Spider-Verse
mov2 |> filter(superhero == 0, scifi == 1) |>
  select(movie) |> slice(1:14)
# A tibble: 14 × 1
   movie                                
   <chr>                                
 1 2001: A Space Odyssey                
 2 About Time                           
 3 Alien                                
 4 Avatar                               
 5 Back to the Future                   
 6 Back to the Future Part II           
 7 Blade Runner 2049                    
 8 Cloud Atlas                          
 9 Despicable Me                        
10 Divergent                            
11 Eternal Sunshine of the Spotless Mind
12 Everything, Everywhere, All at Once  
13 Face/Off                             
14 Gattaca                              
mov2 |> filter(superhero == 0, scifi == 1) |>
  select(movie) |> slice(15:34)
# A tibble: 20 × 1
   movie                                        
   <chr>                                        
 1 Gravity                                      
 2 The Hunger Games                             
 3 The Hunger Games: Catching Fire              
 4 Inception                                    
 5 Interstellar                                 
 6 Jurassic Park                                
 7 The Lobster                                  
 8 The Matrix                                   
 9 Minions: The Rise of Gru                     
10 The Platform (El Hoyo)                       
11 Real Genius                                  
12 Real Steel                                   
13 Resident Evil                                
14 Rise of the Guardians                        
15 Sorry To Bother You                          
16 Star Wars Episode III: Revenge of the Sith   
17 Star Wars: Episode IV: A New Hope            
18 Star Wars: Episode V: The Empire Strikes Back
19 Star Wars: Episode VI: Return of the Jedi    
20 War of the Worlds                            
  • Are these groups comparable?

Movie Questions for Today

  1. Do movies released in 1942-2010 have more user ratings than movies released after 2010? (imdb_ratings, year)

  2. How do movie lengths vary by MPA ratings? (length, mpa)

  3. How strong is the association between how often a movie is rated on IMDB and its number of stars? (imdb_ratings, imdb_stars)

Question 1

Do movies released in 1942-2010 have more user ratings than movies released after 2010? (imdb_ratings, year)

Numerical Summaries

mov2 <- mov2 |>
  mutate(release = fct_recode(factor(year > 2010),
                            After2010 = "TRUE", Older = "FALSE"))

mov2 |> group_by(release) |> reframe(lovedist(imdb_ratings)) |>
  kbl(digits = 1) |> kable_styling(font_size = 20)
release n miss mean sd med mad min q25 q75 max
Older 152 0 592217.1 604230.8 367000 411421.5 4800 161750 860750 2900000
After2010 76 0 482915.6 442535.7 369500 386958.6 282 122000 689500 2200000

Plotting the Two Samples

Let’s plot # of ratings by thousands, to avoid some scientific notation.

ggplot(mov2, aes(x = imdb_ratings/1000, y = release)) +
  geom_violin(aes(fill = release)) +
  geom_boxplot(width = 0.25) +
  stat_summary(fun = mean, geom = "point", col = "red", 
               shape = 16, size = 2) +
  scale_fill_viridis_d(option = "C", alpha = 0.3) +
  guides(fill = "none") +
  labs(y = "Release Date", x = "Thousands of IMDB Ratings")

Plotting the Two Samples

Data not close to Normal

Each sample shows right skew in the # of IMDB user ratings:

Group Sample Mean Sample Median
Older 592,217.1 367,000
After2010 482,915.6 369,500
Difference 109,301.5 -2,500
  • Could we use a bootstrap to compare the means without worrying much about the skew (instead compare medians?)
  • Could we use a non-linear transformation?
  • Let’s use an 89% uncertainty interval. (Why not?)

Bootstrap difference in means

# point estimate 
mov2 |> specify(imdb_ratings ~ release) |>
  calculate(stat = "diff in means", order = c("Older", "After2010"))
Response: imdb_ratings (numeric)
Explanatory: release (factor)
# A tibble: 1 × 1
     stat
    <dbl>
1 109302.
# 89% confidence interval
set.seed(202409241)
mov2 |> specify(imdb_ratings ~ release) |>
  generate(reps = 2500, type = "bootstrap") |>
  calculate(stat = "diff in means", order = c("Older", "After2010")) |>
  get_ci(level = 0.89, type = "percentile")
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1   -7165.  219881.

Bootstrap difference in medians

# point estimate 
mov2 |> specify(imdb_ratings ~ release) |>
  calculate(stat = "diff in medians", order = c("Older", "After2010"))
Response: imdb_ratings (numeric)
Explanatory: release (factor)
# A tibble: 1 × 1
   stat
  <dbl>
1 -2500
# 89% confidence interval
set.seed(202409242)
mov2 |> specify(imdb_ratings ~ release) |>
  generate(reps = 2500, type = "bootstrap") |>
  calculate(stat = "diff in medians", order = c("Older", "After2010")) |>
  get_ci(level = 0.89, type = "percentile")
# A tibble: 1 × 2
  lower_ci upper_ci
     <dbl>    <dbl>
1 -124943.   122500

Right Skew: Try a logarithm?

ggplot(mov2, aes(x = log(imdb_ratings), y = release)) +
  geom_violin(aes(fill = release)) +
  geom_boxplot(width = 0.25) +
  stat_summary(fun = mean, geom = "point", col = "red", 
               shape = 16, size = 3) +
  scale_fill_viridis_d(option = "C", alpha = 0.3) +
  guides(fill = "none") +
  labs(y = "Release Date", x = "log(# of IMDB Ratings)")

Right Skew: Try a logarithm?

Box-Cox suggestion?

fit0 <- lm(imdb_ratings ~ release, data = mov2)
boxCox(fit0)

Right Skew: Try a square root?

ggplot(mov2, aes(x = sqrt(imdb_ratings), y = release)) +
  geom_violin(aes(fill = release)) +
  geom_boxplot(width = 0.25) +
  stat_summary(fun = mean, geom = "point", col = "red", 
               shape = 16, size = 3) +
  scale_fill_viridis_d(option = "C", alpha = 0.3) +
  guides(fill = "none") +
  labs(y = "Release Date", x = "sqrt(# of IMDB Ratings)")

Right Skew: Try a square root?

OLS model

fit1 <- lm(sqrt(imdb_ratings) ~ release, data = mov2)

model_parameters(fit1, ci = 0.89)
Parameter           | Coefficient |    SE |            89% CI | t(226) |      p
-------------------------------------------------------------------------------
(Intercept)         |      675.78 | 28.45 | [ 630.14, 721.43] |  23.76 | < .001
release [After2010] |      -52.72 | 49.27 | [-131.78,  26.33] |  -1.07 | 0.286 
estimate_contrasts(fit1, contrast = "release", ci = 0.89)
Marginal Contrasts Analysis

Level1 |    Level2 | Difference |           89% CI |    SE | t(226) |     p
---------------------------------------------------------------------------
Older  | After2010 |      52.72 | [-26.33, 131.78] | 49.27 |   1.07 | 0.286

Marginal contrasts estimated at release
p-value adjustment method: Holm (1979)
  • Of course, these results are on the square root scale.

Bayesian model

set.seed(202409213)
fit2 <- stan_glm(sqrt(imdb_ratings) ~ release, data = mov2, refresh = 0)
model_parameters(fit2, ci = 0.89)
Parameter        | Median |            89% CI |     pd |  Rhat |     ESS |                     Prior
----------------------------------------------------------------------------------------------------
(Intercept)      | 676.41 | [ 628.71, 722.50] |   100% | 1.000 | 3717.00 | Normal (658.21 +- 877.10)
releaseAfter2010 | -54.06 | [-131.15,  25.31] | 85.60% | 1.001 | 3658.00 |  Normal (0.00 +- 1856.51)
estimate_contrasts(fit2, contrast = "release", ci = 0.89)
Marginal Contrasts Analysis

Level1 |    Level2 | Difference |           89% CI |     pd | % in ROPE
-----------------------------------------------------------------------
Older  | After2010 |      54.06 | [-25.31, 131.15] | 85.60% |     0.08%

Marginal contrasts estimated at release

Making Predictions (1/2)

  • Use our OLS model for sqrt(imdb_ratings) to make predictions on the square root scale.
estimate_means(fit1, ci = 0.89, by = "release", transform = "none") |>
  kbl(digits = 4)
release Mean SE CI_low CI_high
Older 675.7814 28.4476 630.1371 721.4257
After2010 623.0566 40.2310 558.5058 687.6074

Note that \(675.7814^2 \approx 456681\) and \(623.0566^2 \approx 388200\)

Making Predictions (2/2)

  • Use our model for sqrt(imdb_ratings) to make predictions on the original scale of imdb_ratings.
estimate_means(fit1, ci = 0.89, by = "release", 
               transform = "response") |>
  kbl(digits = 0)
release Mean SE CI_low CI_high
Older 456681 38449 397073 520455
After2010 388200 50132 311929 472804

Summary for Question 1

Do movies released in 1942-2010 have more user ratings than movies released after 2010?

Group Sample Mean Sample Median
Older 592,217.1 367,000
After2010 482,915.6 369,500
Difference 109,301.5 -2,500
  • Bootstrap means: diff = 109302, 89% CI (-7165, 219881)
  • Bootstrap medians: diff = -2500, 89% CI (-124943, 122500)

Question 1 Models

Do movies released in 1942-2010 have more user ratings than movies released after 2010?

  • OLS: Square root of user ratings for a movie released in 1942-2010 is, on average, 52.72 (89% CI: -26, 132) higher than for a movie released after 2010.

  • Bayes: Square root of user ratings for a movie released in 1942-2010 is, on average, 54.06 (89% CI: -25, 131) higher than for a movie released after 2010.

Question 2

How do movie lengths vary by MPA ratings? (length, mpa)

We’ll focus on PG, PG-13 and R

mov3 <- mov2 |>
  filter(mpa != "Other") |>
  mutate(mpa = fct_reorder(mpa, length, .desc = TRUE))

mov3 |> group_by(mpa) |> reframe(lovedist(length)) |>
  kbl(digits = 1) |> kable_styling(font_size = 28)
mpa n miss mean sd med mad min q25 q75 max
PG-13 74 0 128.4 25.5 123.0 24.5 90 110.2 143.0 201
R 67 0 128.0 24.4 122.0 23.7 94 110.0 144.0 189
PG 62 0 111.7 18.8 106.5 17.8 87 97.0 124.8 165

Three Independent Samples

ggplot(mov3, aes(x = mpa, y = length)) +
  geom_violin() + 
  geom_boxplot(aes(fill = mpa), width = 0.3) +
  stat_summary(fun = mean, geom = "point", size = 4, 
               shape = 23, col = "snow", fill = "royalblue") +
  scale_fill_viridis_d(option = "D", alpha = 0.5) +
  guides(fill = "none")

Three Independent Samples

OLS model

fit3 <- lm(length ~ mpa, data = mov3)

anova(fit3)
Analysis of Variance Table

Response: length
           Df Sum Sq Mean Sq F value    Pr(>F)    
mpa         2  11714  5857.1  10.842 3.386e-05 ***
Residuals 200 108045   540.2                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
estimate_means(fit3, ci = 0.89)
Estimated Marginal Means

mpa   |   Mean |   SE |           89% CI
----------------------------------------
PG-13 | 128.45 | 2.70 | [124.11, 132.78]
R     | 127.96 | 2.84 | [123.40, 132.51]
PG    | 111.73 | 2.95 | [106.99, 116.46]

Marginal means estimated at mpa

Pairwise Comparisons

estimate_contrasts(fit3, contrast = "mpa", ci = 0.89, p_adjust = "Holm")
Marginal Contrasts Analysis

Level1  | Level2 | Difference |         89% CI |   SE | t(200) |      p
-----------------------------------------------------------------------
(PG-13) |     PG |      16.72 | [ 8.30, 25.14] | 4.00 |   4.18 | < .001
(PG-13) |      R |       0.49 | [-7.75,  8.74] | 3.92 |   0.13 | 0.900 
R       |     PG |      16.23 | [ 7.61, 24.85] | 4.10 |   3.96 | < .001

Marginal contrasts estimated at mpa
p-value adjustment method: Holm (1979)

Build a tibble of contrasts

con_holm <- estimate_contrasts(fit3, ci = 0.89, p_adjust = "holm")

con_holm_tib <- tibble(con_holm) |>  
  mutate(contr = str_c(Level1, " - ", Level2))

con_holm_tib
# A tibble: 3 × 10
  Level1  Level2 Difference CI_low CI_high    SE    df     t        p contr     
  <chr>   <chr>       <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl>    <dbl> <chr>     
1 (PG-13) PG         16.7     8.30   25.1   4.00   200 4.18  0.000132 (PG-13) -…
2 (PG-13) R           0.491  -7.75    8.74  3.92   200 0.125 0.900    (PG-13) -…
3 R       PG         16.2     7.61   24.8   4.10   200 3.96  0.000206 R - PG    

Plot Holm comparisons

ggplot(con_holm_tib, aes(x = contr, y = Difference)) +
  geom_point(size = 3, col = "purple") +
  geom_errorbar(aes(ymin = CI_low, ymax = CI_high)) +
  geom_hline(yintercept = 0, col = "red", lty = "dashed") +
  labs(title = "Holm 89% Intervals for Movie Length",
       x = "Contrast", 
       y = "Difference in Length")

Plot Holm comparisons

Summary for Question 2

How do movie lengths vary by MPA ratings?

MPA \(n\) Sample Mean Sample Median
PG-13 74 128.4 123
R 67 128.0 122
PG 62 111.7 106.5
  • In the sample, PG movies are shorter by 16-17 minutes.
  • Pairwise 89% contrasts yield larger differences between PG and the other mpa groups, than between PG-13 and R.

Simple Linear Regression

Question 3

How strong is the association between how often a movie is rated on IMDB and its number of stars? (imdb_ratings, imdb_stars)

mov2 |> reframe(lovedist(imdb_ratings)) |> 
  kbl(digits = 0) |> kable_styling(font_size = 28)
n miss mean sd med mad min q25 q75 max
228 0 555783 556984 367000 391406 282 158000 783750 2900000
mov2 |> reframe(lovedist(imdb_stars)) |> 
  kbl(digits = 2) |> kable_styling(font_size = 28)
n miss mean sd med mad min q25 q75 max
228 0 7.55 0.87 7.7 0.74 3.4 7.1 8.1 9.3

Scatterplot of 228 movies

We’ll look at stars (on the y axis) vs. ratings (in 100,000s, on x).

ggplot(mov2, aes(x = imdb_ratings/100000, y = imdb_stars)) +
  geom_point() +
  geom_smooth(method = "loess", 
              formula = y ~ x, se = FALSE, col = "blue") +
  geom_smooth(method = "lm", 
              formula = y ~ x, se = TRUE, col = "red") +
  labs(x = "Hundreds of Thousands of IMDB ratings",
       y = "Weighted average star rating")

Scatterplot of 228 movies

Pearson Correlation

cor(mov2$imdb_stars, mov2$imdb_ratings)
[1] 0.6131559
cor(mov2$imdb_stars, (mov2$imdb_ratings/100000))
[1] 0.6131559
  • We can add or multiply by a constant without changing the Pearson correlation coefficient.
cor_test(mov2, "imdb_stars", "imdb_ratings")
Parameter1 |   Parameter2 |    r |       95% CI | t(226) |         p
--------------------------------------------------------------------
imdb_stars | imdb_ratings | 0.61 | [0.53, 0.69] |  11.67 | < .001***

Observations: 228

OLS model with imdb_ratings

fit5 <- lm(imdb_stars ~ imdb_ratings, data = mov2)

model_parameters(fit5, ci = 0.89)
Parameter    | Coefficient |       SE |       89% CI | t(226) |      p
----------------------------------------------------------------------
(Intercept)  |        7.01 |     0.06 | [6.91, 7.12] | 108.44 | < .001
imdb ratings |    9.60e-07 | 8.23e-08 | [0.00, 0.00] |  11.67 | < .001

Very hard to conceptualize \(9.6 \times 10^{-7}\) in any practical context, plus the 89% CI for the slope of imdb_ratings is 0.

  • What if we rescaled the imdb_ratings maintaining the linear relationship?
    • We can add or multiply by any constant we like.
    • Divide # of ratings by 100,000?

OLS with rescaled imdb_ratings

mov4 <- mov2 |>
  mutate(users_100k = imdb_ratings/100000)

fit6 <- lm(imdb_stars ~ users_100k, data = mov4)

model_parameters(fit6, ci = 0.89)
Parameter   | Coefficient |       SE |       89% CI | t(226) |      p
---------------------------------------------------------------------
(Intercept) |        7.01 |     0.06 | [6.91, 7.12] | 108.44 | < .001
users 100k  |        0.10 | 8.23e-03 | [0.08, 0.11] |  11.67 | < .001

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
  • When comparing any two movies whose # of IMDB user ratings are 100,000 apart, we see a star rating that is 0.10 stars (89% CI 0.08, 0.11) higher, on average, for the movie with more user ratings, according to this model.

Performance of the OLS model

model_performance(fit6)
# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
482.090 | 482.197 | 492.378 | 0.376 |     0.373 | 0.687 | 0.690
  • Key summaries for an OLS model with one predictor, like fit6, are \(R^2\) and Sigma (which is similar to RMSE.)
  • \(R^2\) tells us model fit6 accounts for 37.6% of the variation in imdb_stars that we observe in our mov2 data.

Performance of the OLS model

model_performance(fit6)
# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
482.090 | 482.197 | 492.378 | 0.376 |     0.373 | 0.687 | 0.690
  • Our model fit6 assumes that our errors (residuals) come from a Normal distribution with mean 0 and standard deviation Sigma (\(\sigma\)) = 0.69.
  • Thus, about 68% of our predictions should be within \(\pm\) 0.69 stars of the correct outcome, and 95% of our predictions should be within \(\pm 2 \times 0.69\), or 1.38 stars.

Planned Break between Classes 09 and 10.

Performance Measures (1/5)

  • AIC: Akaike’s Information Criterion
  • AICc: Second-order (or small sample) AIC with a correction for small sample sizes
  • BIC: Bayesian Information Criterion

AIC, AICc and BIC are used when comparing one or more models for the same outcome. When comparing models fit using maximum likelihood (like OLS linear models), the smaller the AIC or BIC, the better the fit.

Performance Measures (2/5)

R2: r-squared value = 0.376

  • The R-squared (\(R^2\)) measure for an OLS fit describes how much of the variation in our outcome can be explained using our model (and its predictors.) \(R^2\) falls between 0 and 1, and the closer it is to 1, the better the model fits our data.
  • In a simple (one-predictor) OLS model like this, the value is also the square of the Pearson correlation coefficient, \(r\).
  • We called this \(R^2\) “eta-squared” (\(\eta^2\)) in ANOVA.

Performance Measures (3/5)

R2 (adj.): adjusted r-squared value = 0.373

  • Adjusted R-squared is an index (so it’s not a proportion of anything) for comparing different models (different predictor sets) for the same outcome.
  • The idea is to reduce the temptation to overfit the data, by penalizing the \(R^2\) value a little for each predictor.
  • Adjusted \(R^2\) is usually between 0 and 1, but can be negative.
  • Its formula accounts for the number of observations and the number of predictors in the model.
  • The adjusted \(R^2\) measure can never be larger than \(R^2\).

Performance Measures (4/5)

RMSE = 0.687

  • The RMSE is the square root of the variance of the residuals and summarizes the difference between the observed data and the model’s predicted values.
  • It can be interpreted as the standard deviation of the unexplained variance, and has the same units as the outcome.
  • When comparing models using the same data for the same outcome (but, for instance, with different predictor sets), lower RMSE values indicate better model fit.

Performance Measures (5/5)

Sigma = 0.690

  • Linear models assume that their residuals are drawn from a Normal distribution with mean 0 and standard deviation equal to sigma (\(\sigma\)).
  • This indicates that the predicted outcome will be within \(\pm 1 \sigma\) units of the observed outcome for approximately 68% of the data points, for example.

Checking OLS Model Fit

Main assumptions of any simple linear regression are:

  1. linearity: we assume that the outcome is linearly related to our predictor
  2. constant variance (homoscedasticity): we assume that the variation of our outcome is about the same regardless of the value of our predictor
  3. normal distribution: we assume that the errors around the regression model at any specified values of the x-variables follow an approximately Normal distribution.

To check these assumptions, consider the following plots.

Fitting the diagnostic plots

fit6_diagnostic_plots <- 
  plot(check_model(fit6, panel = FALSE))
For confidence bands, please install `qqplotr`.
  • I don’t worry about confidence bands in these plots.

Checking Linearity

fit6_diagnostic_plots[[2]]

Checking Equal Variances

fit6_diagnostic_plots[[3]]

Checking for Influential Points

fit6_diagnostic_plots[[4]]

Which points are listed in the plot?

None are anywhere near the contours, as it turns out.

mov2 |> slice(c(41, 32, 69, 130, 183)) |> select(mov_id, movie)
# A tibble: 5 × 2
  mov_id movie                               
  <chr>  <chr>                               
1 M-041  The Dark Knight                     
2 M-032  Chinese Doctors (Zhong guo yi sheng)
3 M-069  The Gingerdead Man                  
4 M-130  Madea Goes To Jail                  
5 M-183  The Room                            

Checking Normality

fit6_diagnostic_plots[[5]]

Alternative: Normal Q-Q

plot(fit6, which = 2)

Which points are low outliers?

mov2 |> slice(c(69, 183, 130)) |> select(mov_id, movie, imdb_stars)
# A tibble: 3 × 3
  mov_id movie              imdb_stars
  <chr>  <chr>                   <dbl>
1 M-069  The Gingerdead Man        3.4
2 M-183  The Room                  3.6
3 M-130  Madea Goes To Jail        4.6
mov2 |> select(mov_id, movie, imdb_stars) |> 
  arrange(imdb_stars) |> head(3)
# A tibble: 3 × 3
  mov_id movie              imdb_stars
  <chr>  <chr>                   <dbl>
1 M-069  The Gingerdead Man        3.4
2 M-183  The Room                  3.6
3 M-130  Madea Goes To Jail        4.6

Posterior Predictive Checks

fit6_diagnostic_plots[[1]]

Box-Cox suggestion?

boxCox(fit6)

imdb_stars squared?

p1 <- ggplot(mov2, aes(x = imdb_ratings/100000, y = imdb_stars)) +
  geom_point() +
  geom_smooth(method = "loess", 
              formula = y ~ x, se = FALSE, col = "blue") +
  geom_smooth(method = "lm", 
              formula = y ~ x, se = TRUE, col = "red") +
  labs(x = "Hundreds of Thousands of IMDB ratings",
       y = "Weighted average star rating",
       title = "No transformation")

p2 <- ggplot(mov2, aes(x = imdb_ratings/100000, y = imdb_stars^2)) +
  geom_point() +
  geom_smooth(method = "loess", 
              formula = y ~ x, se = FALSE, col = "blue") +
  geom_smooth(method = "lm", 
              formula = y ~ x, se = TRUE, col = "red") +
  labs(x = "Hundreds of Thousands of IMDB ratings",
       y = "Square of Weighted average star rating",
       title = "Outcome squared")

p1 + p2

imdb_stars squared?

Could we transform either variable?

p3 <- ggplot(mov2, aes(x = imdb_ratings/100000)) +
  geom_histogram(binwidth = 1, col = "white", fill = "dodgerblue") +
  stat_function(fun = function(x)
    dnorm(x, mean = mean(mov2$imdb_ratings/100000, 
                         na.rm = TRUE),
          sd = sd(mov2$imdb_ratings/100000, 
                  na.rm = TRUE)) *
      length(mov2$imdb_ratings/100000) * 1,  
    geom = "area", alpha = 0.5,
    fill = "lightblue", col = "blue") +
  labs(x = "Hundreds of Thousands of IMDB ratings",  y = "",
       title = "IMDB_Ratings / 100000")

p4 <- ggplot(mov2, aes(x = imdb_stars)) +
  geom_histogram(binwidth = 0.2, col = "black", fill = "gold") +
  stat_function(fun = function(x)
    dnorm(x, mean = mean(mov2$imdb_stars,
                         na.rm = TRUE),
          sd = sd(mov2$imdb_stars, 
                  na.rm = TRUE)) *
      length(mov2$imdb_stars) * 0.2,  
    geom = "area", alpha = 0.5,
    fill = "grey80", col = "black") +
  labs(x = "Weighted average star rating", y = "",
       title = "IMDB Stars")

p3 + p4

Could we transform either variable?

Transforming IMDB_ratings?

p3a <- ggplot(mov2, aes(x = log(imdb_ratings))) +
  geom_histogram(binwidth = 0.5, col = "white", fill = "dodgerblue") +
  stat_function(fun = function(x)
    dnorm(x, mean = mean(log(mov2$imdb_ratings), 
                         na.rm = TRUE),
          sd = sd(log(mov2$imdb_ratings), 
                  na.rm = TRUE)) *
      length(log(mov2$imdb_ratings)) * 0.5,  
    geom = "area", alpha = 0.5,
    fill = "lightblue", col = "blue") +
  labs(x = "Log of IMDB ratings",  y = "", 
       title = "Log of IMDB_Ratings")

p3b <- ggplot(mov2, aes(x = sqrt(imdb_ratings))) +
  geom_histogram(binwidth = 100, col = "white", fill = "dodgerblue") +
  stat_function(fun = function(x)
    dnorm(x, mean = mean(sqrt(mov2$imdb_ratings), 
                         na.rm = TRUE),
          sd = sd(sqrt(mov2$imdb_ratings), 
                  na.rm = TRUE)) *
      length(sqrt(mov2$imdb_ratings)) * 100,  
    geom = "area", alpha = 0.5,
    fill = "lightblue", col = "blue") +
  labs(x = "Square Root of IMDB ratings",  y = "", 
       title = "Square Root of IMDB_Ratings")

p3a + p3b

Transforming IMDB_ratings?

New Scatterplot?

p5 <- ggplot(mov2, aes(x = imdb_ratings, y = imdb_stars)) +
  geom_point() +
  geom_smooth(method = "loess", 
              formula = y ~ x, se = FALSE, col = "blue") +
  geom_smooth(method = "lm", 
              formula = y ~ x, se = TRUE, col = "red") +
  labs(x = "IMDB ratings",
       y = "Weighted average star rating",
       title = "No transformation")

p6 <- ggplot(mov2, aes(x = sqrt(imdb_ratings), y = imdb_stars^2)) +
  geom_point() +
  geom_smooth(method = "loess", 
              formula = y ~ x, se = FALSE, col = "blue") +
  geom_smooth(method = "lm", 
              formula = y ~ x, se = TRUE, col = "red") +
  labs(x = "Square Root of IMDB Ratings",
       y = "Weighted average star rating",
       title = "Square Root of Predictor")

p5 + p6

New Scatterplot?

Model with \(\sqrt{\mbox{IMDB_ratings}}\)

fit7 <- lm(imdb_stars ~ sqrt(imdb_ratings), data = mov2)

model_parameters(fit7, ci = 0.89)
Parameter           | Coefficient |       SE |       89% CI | t(226) |      p
-----------------------------------------------------------------------------
(Intercept)         |        6.49 |     0.09 | [6.34, 6.64] |  68.94 | < .001
imdb ratings [sqrt] |    1.60e-03 | 1.26e-04 | [0.00, 0.00] |  12.71 | < .001
  • Same problem as before.
  • Tiny slope coefficients are needlessly hard to interpret.

Rescaled \(\sqrt{\mbox{IMDB_ratings}}\) Model

mov4 <- mov4 |>
  mutate(sqrtratK = sqrt(imdb_ratings)/1000)

fit8 <- lm(imdb_stars ~ sqrtratK, data = mov4)
model_parameters(fit8, ci = 0.89)
Parameter   | Coefficient |   SE |       89% CI | t(226) |      p
-----------------------------------------------------------------
(Intercept) |        6.49 | 0.09 | [6.34, 6.64] |  68.94 | < .001
sqrtratK    |        1.60 | 0.13 | [1.40, 1.81] |  12.71 | < .001

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
  • Suppose we have two movies whose square root of # of IMDB user ratings is 1000 apart. On average, the star rating is 1.60 stars (89% CI 1.40, 1.81) higher for the movie with more IMDB user ratings, according to model fit8.

Scatterplot for model fit8

ggplot(mov4, aes(x = sqrtratK, y = imdb_stars)) +
  geom_point() +
  geom_smooth(method = "loess", 
              formula = y ~ x, se = FALSE, col = "blue") +
  geom_smooth(method = "lm", 
              formula = y ~ x, se = TRUE, col = "red") +
  labs(x = "(square root of IMDB ratings)/1000",
       y = "Weighted average star rating",
       title = "Transformation for `fit8` model")

Scatterplot for model fit8

Model fit8 vs. fit6 performance

fit6 <- lm(imdb_stars ~ users_100k, data = mov4)
fit8 <- lm(imdb_stars ~ sqrtratK, data = mov4)
model_performance(fit6)
# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
482.090 | 482.197 | 492.378 | 0.376 |     0.373 | 0.687 | 0.690
model_performance(fit8)
# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
466.628 | 466.735 | 476.916 | 0.417 |     0.414 | 0.664 | 0.667
  • Which model looks better here?
  • fit8, with lower AIC, BIC, RMSE, Sigma, and higher \(R^2\).

Model Checking for fit8

fit8_diagnostic_plots <- 
  plot(check_model(fit8, panel = FALSE))
For confidence bands, please install `qqplotr`.

Checking Linearity

fit8_diagnostic_plots[[2]]

Checking Equal Variances

fit8_diagnostic_plots[[3]]

Checking for Influential Points

fit8_diagnostic_plots[[4]]

Checking Normality

fit8_diagnostic_plots[[5]]

Posterior Predictive Checks

fit8_diagnostic_plots[[1]]

Restricting the Sample???

What if instead of doing a transformation, we only looked at the subset of movies with over 1,000,000 IMDB user ratings?

mov5 <- mov4 |> filter(imdb_ratings > 1000000)
dim(mov5)
[1] 38 34
ggplot(mov5, aes(x = imdb_ratings/100000, y = imdb_stars)) +
  geom_point() +
  geom_smooth(method = "loess", 
              formula = y ~ x, se = FALSE, col = "blue") +
  geom_smooth(method = "lm", 
              formula = y ~ x, se = TRUE, col = "red") +
  labs(x = "IMDB ratings / 100000",
       y = "Weighted average star rating",
       title = glue(nrow(mov5), " Movies with over 1M IMDB Ratings"))
  • Is this a good idea?

Restricting the Sample???

Labeling the Movies in the Scatterplot

ggplot(mov5, aes(x = imdb_ratings/100000, y = imdb_stars, 
                 label = movie)) +
  geom_point() +
  geom_text_repel() +
  geom_smooth(method = "loess", 
              formula = y ~ x, se = FALSE, col = "blue") +
  geom_smooth(method = "lm", 
              formula = y ~ x, se = TRUE, col = "red") +
  labs(x = "IMDB ratings / 100000",
       y = "Weighted average star rating",
       title = glue(nrow(mov5), " Movies with over 1M IMDB Ratings"))

Labeling the Movies in the Scatterplot

Our subset of 38 movies

p5a <- ggplot(mov5, aes(x = imdb_ratings/100000)) +
  geom_histogram(binwidth = 1, col = "white", fill = "dodgerblue") +
  stat_function(fun = function(x)
    dnorm(x, mean = mean(mov5$imdb_ratings/100000, 
                         na.rm = TRUE),
          sd = sd(mov5$imdb_ratings/100000, 
                  na.rm = TRUE)) *
      length(mov5$imdb_ratings/100000) * 1,  
    geom = "area", alpha = 0.5,
    fill = "lightblue", col = "blue") +
  labs(x = "Hundreds of Thousands of IMDB ratings",  y = "",
       title = "IMDB_Ratings / 100000")

p5b <- ggplot(mov5, aes(x = imdb_stars)) +
  geom_histogram(binwidth = 0.1, col = "black", fill = "gold") +
  stat_function(fun = function(x)
    dnorm(x, mean = mean(mov5$imdb_stars,
                         na.rm = TRUE),
          sd = sd(mov5$imdb_stars, 
                  na.rm = TRUE)) *
      length(mov5$imdb_stars) * 0.1,  
    geom = "area", alpha = 0.5,
    fill = "grey80", col = "black") +
  labs(x = "Weighted average star rating", y = "",
       title = "IMDB Stars")

p5a + p5b

Our subset of 38 movies

Model fit9 for 38 movies

fit9 <- lm(imdb_stars ~ users_100k, data = mov5)

model_parameters(fit9, ci = 0.89)
Parameter   | Coefficient |       SE |       89% CI | t(36) |      p
--------------------------------------------------------------------
(Intercept) |        7.63 |     0.11 | [7.45, 7.82] | 67.96 | < .001
users 100k  |        0.05 | 6.81e-03 | [0.04, 0.07] |  7.91 | < .001

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
  • If two movies each have over 1 million IMDB ratings, and the movies have a 100,000 user difference in IMDB ratings, then the movie with more ratings will, on average, have a star rating that is 0.05 stars higher (89% CI: 0.04, 0.07) than the movie with fewer ratings, according to model fit9.

fit9 Performance and Checking

model_performance(fit9)
# Indices of model performance

AIC    |   AICc |    BIC |    R2 | R2 (adj.) |  RMSE | Sigma
------------------------------------------------------------
-6.848 | -6.142 | -1.935 | 0.635 |     0.625 | 0.204 | 0.210
  • These results with 38 movies cannot be compared to our prior results when we included all 228 movies.
fit9_diagnostic_plots <- 
  plot(check_model(fit9, panel = FALSE))

Checking Linearity

fit9_diagnostic_plots[[2]]

Checking Equal Variances

fit9_diagnostic_plots[[3]]

Checking for Influential Points

fit9_diagnostic_plots[[4]]

Checking Normality

fit9_diagnostic_plots[[5]]

Posterior Predictive Checks

fit9_diagnostic_plots[[1]]

Additional Checks

If you’re desperate, there are some tests / checks…

check_heteroscedasticity(fit9)
OK: Error variance appears to be homoscedastic (p = 0.645).
check_outliers(fit9)
OK: No outliers detected.
- Based on the following method and threshold: cook (0.707).
- For variable: (Whole model)
check_normality(fit9)
OK: residuals appear as normally distributed (p = 0.723).

Note

  • Models fit6 and fit8 passed only check_outliers().
  • Models for movies with at least 200K, 300K and 500K IMDB ratings also passed only the outlier check.

Bayesian Model fit10 for 38 movies

Remember: Set seed; switch to stan_glm(), use refresh = 0.

set.seed(20240926)
fit10 <- stan_glm(imdb_stars ~ users_100k, data = mov5, refresh = 0)
model_parameters(fit10, ci = 0.89)
Parameter   | Median |       89% CI |   pd |  Rhat |     ESS |                 Prior
------------------------------------------------------------------------------------
(Intercept) |   7.63 | [7.45, 7.82] | 100% | 1.000 | 3670.00 | Normal (8.48 +- 0.86)
users_100k  |   0.05 | [0.04, 0.06] | 100% | 1.000 | 3572.00 | Normal (0.00 +- 0.17)
  • If two movies with over 1 million IMDB ratings have a 100,000 user difference in IMDB ratings, then the movie with more ratings will, on average, have a star rating that is 0.05 stars higher (89% CI: 0.04, 0.06) than the movie with fewer ratings, according to fit10.

fit10 Performance and Checking

model_performance(fit10)
# Indices of model performance

ELPD  | ELPD_SE |  LOOIC | LOOIC_SE |   WAIC |    R2 | R2 (adj.) |  RMSE | Sigma
--------------------------------------------------------------------------------
3.286 |   4.555 | -6.572 |    9.109 | -6.636 | 0.620 |     0.598 | 0.204 | 0.214
  • Note that we have some new summaries now. The \(R^2\), RMSE and Sigma values can be compared to fit9 which used the same data. On the whole, fit10 looks slightly worse than fit9 on these metrics.
  • Let’s get the diagnostic plots for fit10.
fit10_diagnostic_plots <- 
  plot(check_model(fit10, panel = FALSE))

Checking Linearity

fit10_diagnostic_plots[[2]]

Checking Equal Variances

fit10_diagnostic_plots[[3]]

Checking for Influential Points

fit10_diagnostic_plots[[4]]

Checking Normality

fit10_diagnostic_plots[[5]]

Posterior Predictive Checks

fit10_diagnostic_plots[[1]]

Not much to choose from here…

fit10 and fit9 are pretty similar in terms of estimated parameters, performance metrics, and diagnostic checks.

check_heteroscedasticity(fit10)
OK: Error variance appears to be homoscedastic (p = 0.626).
check_outliers(fit10)
OK: No outliers detected.
- Based on the following method and threshold: pareto (0.7).
- For variable: (Whole model)

Session Information

xfun::session_info()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  abind_1.4-5          askpass_1.2.0        backports_1.5.0     
  base64enc_0.1-3      bayesplot_1.11.1     bayestestR_0.14.0   
  BH_1.84.0.0          bit_4.0.5            bit64_4.0.5         
  blob_1.2.4           boot_1.3-31          broom_1.0.6         
  bslib_0.8.0          cachem_1.1.0         callr_3.7.6         
  car_3.1-2            carData_3.0-5        cellranger_1.1.0    
  checkmate_2.3.2      cli_3.6.3            clipr_0.8.0         
  coda_0.19-4.1        codetools_0.2-20     colorspace_2.1-1    
  colourpicker_1.3.0   commonmark_1.9.1     compiler_4.4.1      
  conflicted_1.2.0     correlation_0.8.5    cowplot_1.1.3       
  cpp11_0.5.0          crayon_1.5.3         crosstalk_1.2.1     
  curl_5.2.2           data.table_1.16.0    datasets_4.4.1      
  datawizard_0.12.3    DBI_1.2.3            dbplyr_2.5.0        
  Deriv_4.1.3          desc_1.4.3           digest_0.6.37       
  distributional_0.4.0 doBy_4.6.22          dplyr_1.1.4         
  DT_0.33              dtplyr_1.3.1         dygraphs_1.1.1.6    
  easystats_0.7.3      effectsize_0.8.9     emmeans_1.10.4      
  estimability_1.5.1   evaluate_0.24.0      fansi_1.0.6         
  farver_2.1.2         fastmap_1.2.0        fontawesome_0.5.2   
  forcats_1.0.0        fs_1.6.4             gargle_1.5.2        
  generics_0.1.3       ggplot2_3.5.1        ggrepel_0.9.6       
  ggridges_0.5.6       glue_1.7.0           googledrive_2.1.1   
  googlesheets4_1.1.1  graphics_4.4.1       grDevices_4.4.1     
  grid_4.4.1           gridExtra_2.3        gtable_0.3.5        
  gtools_3.9.5         haven_2.5.4          highr_0.11          
  hms_1.1.3            htmltools_0.5.8.1    htmlwidgets_1.6.4   
  httpuv_1.6.15        httr_1.4.7           ids_1.0.1           
  igraph_2.0.3         infer_1.0.7          inline_0.3.19       
  insight_0.20.4       isoband_0.2.7        janitor_2.2.0       
  jquerylib_0.1.4      jsonlite_1.8.8       kableExtra_1.4.0    
  knitr_1.48           labeling_0.4.3       later_1.3.2         
  lattice_0.22-6       lazyeval_0.2.2       lifecycle_1.0.4     
  lme4_1.1-35.5        loo_2.8.0            lubridate_1.9.3     
  magrittr_2.0.3       markdown_1.13        MASS_7.3-61         
  Matrix_1.7-0         MatrixModels_0.5.3   matrixStats_1.4.0   
  memoise_2.0.1        methods_4.4.1        mgcv_1.9-1          
  microbenchmark_1.5.0 mime_0.12            miniUI_0.1.1.1      
  minqa_1.2.8          modelbased_0.8.8     modelr_0.1.11       
  multcomp_1.4-26      munsell_0.5.1        mvtnorm_1.3-1       
  naniar_1.1.0         nlme_3.1-164         nloptr_2.1.1        
  nnet_7.3.19          norm_1.0.11.1        numDeriv_2016.8.1.1 
  openssl_2.2.1        parallel_4.4.1       parameters_0.22.2   
  patchwork_1.2.0      pbkrtest_0.5.3       performance_0.12.3  
  pillar_1.9.0         pkgbuild_1.4.4       pkgconfig_2.0.3     
  plyr_1.8.9           posterior_1.6.0      prettyunits_1.2.0   
  processx_3.8.4       progress_1.2.3       promises_1.3.0      
  ps_1.7.7             purrr_1.0.2          quantreg_5.98       
  QuickJSR_1.3.1       R6_2.5.1             ragg_1.3.2          
  rappdirs_0.3.3       RColorBrewer_1.1.3   Rcpp_1.0.13         
  RcppEigen_0.3.4.0.2  RcppParallel_5.1.9   readr_2.1.5         
  readxl_1.4.3         rematch_2.0.0        rematch2_2.1.2      
  report_0.5.9         reprex_2.1.1         reshape2_1.4.4      
  rlang_1.1.4          rmarkdown_2.28       rstan_2.32.6        
  rstanarm_2.32.1      rstantools_2.4.0     rstudioapi_0.16.0   
  rvest_1.0.4          sandwich_3.1-0       sass_0.4.9          
  scales_1.3.0         see_0.9.0            selectr_0.4.2       
  shiny_1.9.1          shinyjs_2.1.0        shinystan_2.6.0     
  shinythemes_1.2.0    snakecase_0.11.1     sourcetools_0.1.7.1 
  SparseM_1.84.2       splines_4.4.1        StanHeaders_2.32.10 
  stats_4.4.1          stats4_4.4.1         stringi_1.8.4       
  stringr_1.5.1        survival_3.7-0       svglite_2.1.3       
  sys_3.4.2            systemfonts_1.1.0    tensorA_0.36.2.1    
  textshaping_0.4.0    TH.data_1.1-2        threejs_0.3.3       
  tibble_3.2.1         tidyr_1.3.1          tidyselect_1.2.1    
  tidyverse_2.0.0      timechange_0.3.0     tinytex_0.52        
  tools_4.4.1          tzdb_0.4.0           UpSetR_1.4.0        
  utf8_1.2.4           utils_4.4.1          uuid_1.2.1          
  V8_5.0.0             vctrs_0.6.5          viridis_0.6.5       
  viridisLite_0.4.2    visdat_0.6.0         vroom_1.6.5         
  withr_3.0.1          xfun_0.47            xml2_1.3.6          
  xtable_1.8-4         xts_0.14.0           yaml_2.3.10         
  zoo_1.8-12